{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Path to notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/example_structured.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/example_structured.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
   "source": [
    "from presidio_structured import StructuredEngine, JsonAnalysisBuilder, PandasAnalysisBuilder, StructuredAnalysis, CsvReader, JsonReader, JsonDataProcessor, PandasDataProcessor"
   ],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This sample showcases presidio-structured on structured and semi-structured data containing sensitive data like names, emails, and addresses. It differs from the sample for the batch analyzer/anonymizer engines example, which includes narrative phrases that might contain sensitive data. The presence of personal data embedded in these phrases requires to analyze and to anonymize the text inside the cells, which is not the case for our structured sample, where the sensitive data is already separated into columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading in data"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "sample_df = CsvReader().read(\"./csv_sample_data/test_structured.csv\")\n",
    "sample_df"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "   id           name                      email       street       city state  \\\n",
       "0   1       John Doe       john.doe@example.com  123 Main St    Anytown    CA   \n",
       "1   2     Jane Smith     jane.smith@example.com   456 Elm St  Somewhere    TX   \n",
       "2   3  Alice Johnson  alice.johnson@example.com  789 Pine St  Elsewhere    NY   \n",
       "\n",
       "              non_pii  \n",
       "0        reallynotpii  \n",
       "1       reallynotapii  \n",
       "2  reallynotapiiatall  "
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "      <th>email</th>\n",
       "      <th>street</th>\n",
       "      <th>city</th>\n",
       "      <th>state</th>\n",
       "      <th>non_pii</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>John Doe</td>\n",
       "      <td>john.doe@example.com</td>\n",
       "      <td>123 Main St</td>\n",
       "      <td>Anytown</td>\n",
       "      <td>CA</td>\n",
       "      <td>reallynotpii</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jane Smith</td>\n",
       "      <td>jane.smith@example.com</td>\n",
       "      <td>456 Elm St</td>\n",
       "      <td>Somewhere</td>\n",
       "      <td>TX</td>\n",
       "      <td>reallynotapii</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Alice Johnson</td>\n",
       "      <td>alice.johnson@example.com</td>\n",
       "      <td>789 Pine St</td>\n",
       "      <td>Elsewhere</td>\n",
       "      <td>NY</td>\n",
       "      <td>reallynotapiiatall</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 13
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "sample_json = JsonReader().read(\"./sample_data/test_structured.json\")\n",
    "sample_json"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'id': 1,\n",
       " 'name': 'John Doe',\n",
       " 'email': 'john.doe@example.com',\n",
       " 'address': {'street': '123 Main St',\n",
       "  'city': 'Anytown',\n",
       "  'state': 'CA',\n",
       "  'non_pii': 'reallynotapiiatall'}}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 14
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# contains nested objects in lists\n",
    "sample_complex_json = JsonReader().read(\"./sample_data/test_structured_complex.json\")\n",
    "sample_complex_json"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'users': [{'id': 1,\n",
       "   'name': 'John Doe',\n",
       "   'email': 'john.doe@example.com',\n",
       "   'address': {'street': '123 Main St',\n",
       "    'city': 'Anytown',\n",
       "    'state': 'CA',\n",
       "    'non_pii': 'reallynotpii'}},\n",
       "  {'id': 2,\n",
       "   'name': 'Jane Smith',\n",
       "   'email': 'jane.smith@example.com',\n",
       "   'address': {'street': '456 Elm St',\n",
       "    'city': 'Somewhere',\n",
       "    'state': 'TX',\n",
       "    'non_pii': 'reallynotapii'}},\n",
       "  {'id': 3,\n",
       "   'name': 'Alice Johnson',\n",
       "   'email': 'alice.johnson@example.com',\n",
       "   'address': {'street': '789 Pine St',\n",
       "    'city': 'Elsewhere',\n",
       "    'state': 'NY',\n",
       "    'non_pii': 'reallynotapiiatall'}}]}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 15
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tabular (csv) data: defining & generating tabular analysis, anonymization."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-11-06T10:03:24.281936Z",
     "start_time": "2024-11-06T10:03:23.485232Z"
    }
   },
   "source": [
    "# Automatically detect the entity for the columns\n",
    "tabular_analysis = PandasAnalysisBuilder().generate_analysis(sample_df)\n",
    "tabular_analysis"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'URL', 'city': 'LOCATION', 'state': 'LOCATION'})"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 16
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-11-06T10:03:24.950846Z",
     "start_time": "2024-11-06T10:03:24.943507Z"
    }
   },
   "source": [
    "# anonymized data defaults to be replaced with None, unless operators is specified\n",
    "\n",
    "pandas_engine = StructuredEngine(data_processor=PandasDataProcessor())\n",
    "df_to_be_anonymized = sample_df.copy() # in-place anonymization\n",
    "anonymized_df = pandas_engine.anonymize(df_to_be_anonymized, tabular_analysis, operators=None) # explicit None for clarity\n",
    "anonymized_df"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "   id    name   email       street    city   state             non_pii\n",
       "0   1  <None>  <None>  123 Main St  <None>  <None>        reallynotpii\n",
       "1   2  <None>  <None>   456 Elm St  <None>  <None>       reallynotapii\n",
       "2   3  <None>  <None>  789 Pine St  <None>  <None>  reallynotapiiatall"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "      <th>email</th>\n",
       "      <th>street</th>\n",
       "      <th>city</th>\n",
       "      <th>state</th>\n",
       "      <th>non_pii</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>123 Main St</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>reallynotpii</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>456 Elm St</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>reallynotapii</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>789 Pine St</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>reallynotapiiatall</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 17
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We can also define operators using OperatorConfig similar as to the AnonymizerEngine:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-11-06T10:03:26.304380Z",
     "start_time": "2024-11-06T10:03:26.295648Z"
    }
   },
   "source": [
    "from presidio_anonymizer.entities.engine import OperatorConfig\n",
    "from faker import Faker\n",
    "fake = Faker()\n",
    "\n",
    "operators = {\n",
    "    \"PERSON\": OperatorConfig(\"replace\", {\"new_value\": \"person...\"}),\n",
    "    \"EMAIL_ADDRESS\": OperatorConfig(\"custom\", {\"lambda\": lambda x: fake.safe_email()})\n",
    "    # etc...\n",
    "    }\n",
    "anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators)\n",
    "anonymized_df"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "   id       name   email       street    city   state             non_pii\n",
       "0   1  person...  <None>  123 Main St  <None>  <None>        reallynotpii\n",
       "1   2  person...  <None>   456 Elm St  <None>  <None>       reallynotapii\n",
       "2   3  person...  <None>  789 Pine St  <None>  <None>  reallynotapiiatall"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "      <th>email</th>\n",
       "      <th>street</th>\n",
       "      <th>city</th>\n",
       "      <th>state</th>\n",
       "      <th>non_pii</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>person...</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>123 Main St</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>reallynotpii</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>person...</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>456 Elm St</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>reallynotapii</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>person...</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>789 Pine St</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>&lt;None&gt;</td>\n",
       "      <td>reallynotapiiatall</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 18
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Semi-structured (JSON) data: simple and complex analysis, anonymization"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-11-06T10:03:28.449039Z",
     "start_time": "2024-11-06T10:03:27.723798Z"
    }
   },
   "source": [
    "json_analysis = JsonAnalysisBuilder().generate_analysis(sample_json)\n",
    "json_analysis"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'EMAIL_ADDRESS', 'address.city': 'LOCATION'})"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 19
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-11-06T10:03:29.979160Z",
     "start_time": "2024-11-06T10:03:29.204211Z"
    }
   },
   "source": [
    "# Currently does not support nested objects in lists\n",
    "try:\n",
    "    json_complex_analysis = JsonAnalysisBuilder().generate_analysis(sample_complex_json)\n",
    "except ValueError as e:\n",
    "    print(e)\n",
    "\n",
    "# however, we can define it manually:\n",
    "json_complex_analysis = StructuredAnalysis(entity_mapping={\n",
    "    \"users.name\":\"PERSON\",\n",
    "    \"users.address.street\":\"LOCATION\",\n",
    "    \"users.address.city\":\"LOCATION\",\n",
    "    \"users.address.state\":\"LOCATION\",\n",
    "    \"users.email\": \"EMAIL_ADDRESS\",\n",
    "})"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Analyzer.analyze_iterator only works on primitive types (int, float, bool, str). Lists of objects are not yet supported.\n"
     ]
    }
   ],
   "execution_count": 20
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-11-06T10:03:40.550804Z",
     "start_time": "2024-11-06T10:03:40.546967Z"
    }
   },
   "source": [
    "# anonymizing simple data\n",
    "json_engine = StructuredEngine(data_processor=JsonDataProcessor())\n",
    "anonymized_json = json_engine.anonymize(sample_json, json_analysis, operators=operators)\n",
    "anonymized_json"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'id': 1,\n",
       " 'name': 'person...',\n",
       " 'email': 'duarteangela@example.org',\n",
       " 'address': {'street': '123 Main St',\n",
       "  'city': '<None>',\n",
       "  'state': 'CA',\n",
       "  'non_pii': 'reallynotapiiatall'}}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 21
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-11-06T10:03:41.922324Z",
     "start_time": "2024-11-06T10:03:41.917466Z"
    }
   },
   "source": [
    "anonymized_complex_json = json_engine.anonymize(sample_complex_json, json_complex_analysis, operators=operators)\n",
    "anonymized_complex_json"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'users': [{'id': 1,\n",
       "   'name': 'person...',\n",
       "   'email': 'bmcfarland@example.org',\n",
       "   'address': {'street': '<None>',\n",
       "    'city': '<None>',\n",
       "    'state': '<None>',\n",
       "    'non_pii': 'reallynotpii'}},\n",
       "  {'id': 2,\n",
       "   'name': 'person...',\n",
       "   'email': 'bmcfarland@example.org',\n",
       "   'address': {'street': '<None>',\n",
       "    'city': '<None>',\n",
       "    'state': '<None>',\n",
       "    'non_pii': 'reallynotapii'}},\n",
       "  {'id': 3,\n",
       "   'name': 'person...',\n",
       "   'email': 'bmcfarland@example.org',\n",
       "   'address': {'street': '<None>',\n",
       "    'city': '<None>',\n",
       "    'state': '<None>',\n",
       "    'non_pii': 'reallynotapiiatall'}}]}"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 22
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": ""
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
